We have the following data from a beauty e-commerce in Russia:
We perform an exploratory data analysis, including a RFM analysis. After that, we perform a recommender system based on collaborative filtering.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_theme(style="whitegrid")
#Autocomplete
%config IPCompleter.greedy=True
#Remove scientific notation
pd.options.display.float_format = '{:15.2f}'.format
import sqlalchemy as sa
con = sa.create_engine('sqlite:///ecommerce.db')
from sqlalchemy import inspect
insp = inspect(con)
tablas = insp.get_table_names()
tablas
['2019-Dec', '2019-Nov', '2019-Oct', '2020-Feb', '2020-Jan']
oct = pd.read_sql('2019-Oct', con)
nov = pd.read_sql('2019-Nov', con)
dec = pd.read_sql('2019-Dec', con)
jan = pd.read_sql('2020-Jan', con)
feb = pd.read_sql('2020-Feb', con)
df = pd.concat([oct,nov,dec,jan,feb], axis = 0)
df
| index | event_time | event_type | product_id | category_id | category_code | brand | price | user_id | user_session | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 68 | 2019-10-01 00:01:46 UTC | view | 5843665 | 1487580005092295511 | None | f.o.x | 9.44 | 462033176 | a18e0999-61a1-4218-8f8f-61ec1d375361 |
| 1 | 72 | 2019-10-01 00:01:55 UTC | cart | 5868461 | 1487580013069861041 | None | italwax | 3.57 | 514753614 | e2fecb2d-22d0-df2c-c661-15da44b3ccf1 |
| 2 | 95 | 2019-10-01 00:02:50 UTC | view | 5877456 | 1487580006300255120 | None | jessnail | 122.22 | 527418424 | 86e77869-afbc-4dff-9aa2-6b7dd8c90770 |
| 3 | 122 | 2019-10-01 00:03:41 UTC | view | 5649270 | 1487580013749338323 | None | concept | 6.19 | 555448072 | b5f72ceb-0730-44de-a932-d16db62390df |
| 4 | 124 | 2019-10-01 00:03:44 UTC | view | 18082 | 1487580005411062629 | None | cnd | 16.03 | 552006247 | 2d8f304b-de45-4e59-8f40-50c603843fe5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 429785 | 4156660 | 2020-02-29 23:58:49 UTC | cart | 5815662 | 1487580006317032337 | None | None | 0.92 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
| 429786 | 4156663 | 2020-02-29 23:58:57 UTC | view | 5815665 | 1487580006317032337 | None | None | 0.59 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
| 429787 | 4156668 | 2020-02-29 23:59:05 UTC | cart | 5815665 | 1487580006317032337 | None | None | 0.59 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
| 429788 | 4156675 | 2020-02-29 23:59:28 UTC | view | 5817692 | 1487580010872045658 | None | None | 0.79 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d |
| 429789 | 4156680 | 2020-02-29 23:59:54 UTC | view | 5716351 | 1487580010872045658 | None | irisk | 0.79 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d |
2095076 rows × 10 columns
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2095076 entries, 0 to 429789 Data columns (total 10 columns): # Column Dtype --- ------ ----- 0 index int64 1 event_time object 2 event_type object 3 product_id int64 4 category_id int64 5 category_code object 6 brand object 7 price float64 8 user_id int64 9 user_session object dtypes: float64(1), int64(4), object(5) memory usage: 175.8+ MB
Delete the index column and convert event_time to a datetime format.
df.drop(columns = 'index', inplace = True)
df.event_time = pd.to_datetime(df.event_time)
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2095076 entries, 0 to 429789 Data columns (total 9 columns): # Column Dtype --- ------ ----- 0 event_time datetime64[ns, UTC] 1 event_type object 2 product_id int64 3 category_id int64 4 category_code object 5 brand object 6 price float64 7 user_id int64 8 user_session object dtypes: datetime64[ns, UTC](1), float64(1), int64(3), object(4) memory usage: 159.8+ MB
df.isna().sum().sort_values(ascending = False)
category_code 2060411 brand 891646 user_session 506 event_time 0 event_type 0 product_id 0 category_id 0 price 0 user_id 0 dtype: int64
Conclusions:
We are going to:
df = df.drop(columns = ['category_code','brand']).dropna()
df
| event_time | event_type | product_id | category_id | price | user_id | user_session | |
|---|---|---|---|---|---|---|---|
| 0 | 2019-10-01 00:01:46+00:00 | view | 5843665 | 1487580005092295511 | 9.44 | 462033176 | a18e0999-61a1-4218-8f8f-61ec1d375361 |
| 1 | 2019-10-01 00:01:55+00:00 | cart | 5868461 | 1487580013069861041 | 3.57 | 514753614 | e2fecb2d-22d0-df2c-c661-15da44b3ccf1 |
| 2 | 2019-10-01 00:02:50+00:00 | view | 5877456 | 1487580006300255120 | 122.22 | 527418424 | 86e77869-afbc-4dff-9aa2-6b7dd8c90770 |
| 3 | 2019-10-01 00:03:41+00:00 | view | 5649270 | 1487580013749338323 | 6.19 | 555448072 | b5f72ceb-0730-44de-a932-d16db62390df |
| 4 | 2019-10-01 00:03:44+00:00 | view | 18082 | 1487580005411062629 | 16.03 | 552006247 | 2d8f304b-de45-4e59-8f40-50c603843fe5 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 429785 | 2020-02-29 23:58:49+00:00 | cart | 5815662 | 1487580006317032337 | 0.92 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
| 429786 | 2020-02-29 23:58:57+00:00 | view | 5815665 | 1487580006317032337 | 0.59 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
| 429787 | 2020-02-29 23:59:05+00:00 | cart | 5815665 | 1487580006317032337 | 0.59 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
| 429788 | 2020-02-29 23:59:28+00:00 | view | 5817692 | 1487580010872045658 | 0.79 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d |
| 429789 | 2020-02-29 23:59:54+00:00 | view | 5716351 | 1487580010872045658 | 0.79 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d |
2094570 rows × 7 columns
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| product_id | 2094570.00 | 5487103.56 | 1300923.90 | 3752.00 | 5724652.00 | 5811665.00 | 5858353.00 | 5932595.00 |
| category_id | 2094570.00 | 1553112489402986240.00 | 167907497920779328.00 | 1487580004807082752.00 | 1487580005754995456.00 | 1487580008246412288.00 | 1487580013489291520.00 | 2242903426784559104.00 |
| price | 2094570.00 | 8.42 | 19.14 | -47.62 | 2.05 | 4.00 | 6.86 | 327.78 |
| user_id | 2094570.00 | 521077545.56 | 87553855.76 | 4661182.00 | 480613387.00 | 553341613.00 | 578406571.00 | 622087993.00 |
We see negatives in the price. We are going to analyze it.
df[df.price <= 0]
| event_time | event_type | product_id | category_id | price | user_id | user_session | |
|---|---|---|---|---|---|---|---|
| 343 | 2019-10-01 02:15:41+00:00 | view | 5892052 | 1487580010377117763 | 0.00 | 555455025 | 320f6021-30ac-4a58-ae17-bac1cc32aac3 |
| 924 | 2019-10-01 05:16:30+00:00 | view | 5889621 | 1487580010561667147 | 0.00 | 523988665 | 00849bd2-fcd2-4cb4-af31-4e264f151848 |
| 933 | 2019-10-01 05:18:03+00:00 | view | 5889622 | 1487580010561667147 | 0.00 | 523988665 | 80cfe614-f0a5-4101-a2b6-a21227590470 |
| 937 | 2019-10-01 05:18:46+00:00 | view | 5889623 | 1487580010561667147 | 0.00 | 523988665 | c2cd0464-3d2b-48e2-9667-bac248fe297a |
| 1077 | 2019-10-01 05:38:01+00:00 | view | 5889627 | 1487580010561667147 | 0.00 | 523988665 | 8b2bf9d8-43f0-43b2-bed3-13b2c956cada |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 428011 | 2020-02-29 20:04:49+00:00 | cart | 5824841 | 1897124478404526487 | 0.00 | 469761446 | 8bf369b4-92c0-4fb8-88a5-8a2dd0947e46 |
| 428012 | 2020-02-29 20:04:49+00:00 | cart | 5826413 | 1487580005511725929 | 0.00 | 469761446 | 8bf369b4-92c0-4fb8-88a5-8a2dd0947e46 |
| 428013 | 2020-02-29 20:04:49+00:00 | cart | 5832437 | 1487580007675986893 | 0.00 | 469761446 | 8bf369b4-92c0-4fb8-88a5-8a2dd0947e46 |
| 428014 | 2020-02-29 20:04:49+00:00 | cart | 5851606 | 2055161088059638328 | 0.00 | 469761446 | 8bf369b4-92c0-4fb8-88a5-8a2dd0947e46 |
| 428370 | 2020-02-29 20:26:16+00:00 | view | 5923106 | 1487580008246412266 | 0.00 | 622047714 | 74f04dc6-2b3c-4565-beda-f575d73ed81c |
20544 rows × 7 columns
There are about 25000 records. Do the belong to certain products?
df[df.price <= 0].product_id.value_counts().head(10)
5896186 79 5903915 50 5873428 37 5851294 29 5851304 29 5837624 28 5712583 27 5851272 27 5907812 26 5899512 26 Name: product_id, dtype: int64
It does not seem to be a problem with a specific product, so we will delete all records.
df = df[df.price > 0]
df
| event_time | event_type | product_id | category_id | price | user_id | user_session | |
|---|---|---|---|---|---|---|---|
| 0 | 2019-10-01 00:01:46+00:00 | view | 5843665 | 1487580005092295511 | 9.44 | 462033176 | a18e0999-61a1-4218-8f8f-61ec1d375361 |
| 1 | 2019-10-01 00:01:55+00:00 | cart | 5868461 | 1487580013069861041 | 3.57 | 514753614 | e2fecb2d-22d0-df2c-c661-15da44b3ccf1 |
| 2 | 2019-10-01 00:02:50+00:00 | view | 5877456 | 1487580006300255120 | 122.22 | 527418424 | 86e77869-afbc-4dff-9aa2-6b7dd8c90770 |
| 3 | 2019-10-01 00:03:41+00:00 | view | 5649270 | 1487580013749338323 | 6.19 | 555448072 | b5f72ceb-0730-44de-a932-d16db62390df |
| 4 | 2019-10-01 00:03:44+00:00 | view | 18082 | 1487580005411062629 | 16.03 | 552006247 | 2d8f304b-de45-4e59-8f40-50c603843fe5 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 429785 | 2020-02-29 23:58:49+00:00 | cart | 5815662 | 1487580006317032337 | 0.92 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
| 429786 | 2020-02-29 23:58:57+00:00 | view | 5815665 | 1487580006317032337 | 0.59 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
| 429787 | 2020-02-29 23:59:05+00:00 | cart | 5815665 | 1487580006317032337 | 0.59 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
| 429788 | 2020-02-29 23:59:28+00:00 | view | 5817692 | 1487580010872045658 | 0.79 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d |
| 429789 | 2020-02-29 23:59:54+00:00 | view | 5716351 | 1487580010872045658 | 0.79 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d |
2074026 rows × 7 columns
df.event_type.nunique()
4
df.event_type.value_counts()
view 961558 cart 574547 remove_from_cart 410357 purchase 127564 Name: event_type, dtype: int64
df.product_id.nunique()
45327
df.category_id.nunique()
508
df.set_index('event_time', inplace = True)
df
| event_type | product_id | category_id | price | user_id | user_session | |
|---|---|---|---|---|---|---|
| event_time | ||||||
| 2019-10-01 00:01:46+00:00 | view | 5843665 | 1487580005092295511 | 9.44 | 462033176 | a18e0999-61a1-4218-8f8f-61ec1d375361 |
| 2019-10-01 00:01:55+00:00 | cart | 5868461 | 1487580013069861041 | 3.57 | 514753614 | e2fecb2d-22d0-df2c-c661-15da44b3ccf1 |
| 2019-10-01 00:02:50+00:00 | view | 5877456 | 1487580006300255120 | 122.22 | 527418424 | 86e77869-afbc-4dff-9aa2-6b7dd8c90770 |
| 2019-10-01 00:03:41+00:00 | view | 5649270 | 1487580013749338323 | 6.19 | 555448072 | b5f72ceb-0730-44de-a932-d16db62390df |
| 2019-10-01 00:03:44+00:00 | view | 18082 | 1487580005411062629 | 16.03 | 552006247 | 2d8f304b-de45-4e59-8f40-50c603843fe5 |
| ... | ... | ... | ... | ... | ... | ... |
| 2020-02-29 23:58:49+00:00 | cart | 5815662 | 1487580006317032337 | 0.92 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
| 2020-02-29 23:58:57+00:00 | view | 5815665 | 1487580006317032337 | 0.59 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
| 2020-02-29 23:59:05+00:00 | cart | 5815665 | 1487580006317032337 | 0.59 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
| 2020-02-29 23:59:28+00:00 | view | 5817692 | 1487580010872045658 | 0.79 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d |
| 2020-02-29 23:59:54+00:00 | view | 5716351 | 1487580010872045658 | 0.79 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d |
2074026 rows × 6 columns
We are going to create 3 types of new variables
def components_date(dataframe):
date = dataframe.index.date
year = dataframe.index.year
month = dataframe.index.month
day = dataframe.index.day
hour = dataframe.index.hour
return(pd.DataFrame({'date':date, 'year':year,'month':month, 'day':day, 'hour':hour}))
df = pd.concat([df.reset_index(),components_date(df)], axis = 1).set_index('event_time')
df
| event_type | product_id | category_id | price | user_id | user_session | date | year | month | day | hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| event_time | |||||||||||
| 2019-10-01 00:01:46+00:00 | view | 5843665 | 1487580005092295511 | 9.44 | 462033176 | a18e0999-61a1-4218-8f8f-61ec1d375361 | 2019-10-01 | 2019 | 10 | 1 | 0 |
| 2019-10-01 00:01:55+00:00 | cart | 5868461 | 1487580013069861041 | 3.57 | 514753614 | e2fecb2d-22d0-df2c-c661-15da44b3ccf1 | 2019-10-01 | 2019 | 10 | 1 | 0 |
| 2019-10-01 00:02:50+00:00 | view | 5877456 | 1487580006300255120 | 122.22 | 527418424 | 86e77869-afbc-4dff-9aa2-6b7dd8c90770 | 2019-10-01 | 2019 | 10 | 1 | 0 |
| 2019-10-01 00:03:41+00:00 | view | 5649270 | 1487580013749338323 | 6.19 | 555448072 | b5f72ceb-0730-44de-a932-d16db62390df | 2019-10-01 | 2019 | 10 | 1 | 0 |
| 2019-10-01 00:03:44+00:00 | view | 18082 | 1487580005411062629 | 16.03 | 552006247 | 2d8f304b-de45-4e59-8f40-50c603843fe5 | 2019-10-01 | 2019 | 10 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2020-02-29 23:58:49+00:00 | cart | 5815662 | 1487580006317032337 | 0.92 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 | 2020-02-29 | 2020 | 2 | 29 | 23 |
| 2020-02-29 23:58:57+00:00 | view | 5815665 | 1487580006317032337 | 0.59 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 | 2020-02-29 | 2020 | 2 | 29 | 23 |
| 2020-02-29 23:59:05+00:00 | cart | 5815665 | 1487580006317032337 | 0.59 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 | 2020-02-29 | 2020 | 2 | 29 | 23 |
| 2020-02-29 23:59:28+00:00 | view | 5817692 | 1487580010872045658 | 0.79 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d | 2020-02-29 | 2020 | 2 | 29 | 23 |
| 2020-02-29 23:59:54+00:00 | view | 5716351 | 1487580010872045658 | 0.79 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d | 2020-02-29 | 2020 | 2 | 29 | 23 |
2074026 rows × 11 columns
Since the e-commerce is set on Russia, wioth the leverage of the holidays packet, we import the local holidays from Russia.
import holidays
holiday_ru = holidays.RU(years=2020)
holiday_ru
{datetime.date(2020, 1, 1): 'Новый год', datetime.date(2020, 1, 2): 'Новый год', datetime.date(2020, 1, 3): 'Новый год', datetime.date(2020, 1, 4): 'Новый год', datetime.date(2020, 1, 5): 'Новый год', datetime.date(2020, 1, 6): 'Новый год', datetime.date(2020, 1, 7): 'Православное Рождество', datetime.date(2020, 1, 8): 'Новый год', datetime.date(2020, 2, 23): 'День защитника отечества', datetime.date(2020, 3, 8): 'День женщин', datetime.date(2020, 5, 1): 'Праздник Весны и Труда', datetime.date(2020, 5, 9): 'День Победы', datetime.date(2020, 6, 12): 'День России', datetime.date(2020, 11, 4): 'День народного единства', datetime.date(2020, 12, 31): 'Новый год'}
We are going to include a variable that tells in each record whether it was a holiday or not.
df['holiday'] = df.date.apply(lambda x: 1 if (x in holiday_ru) else 0)
df
| event_type | product_id | category_id | price | user_id | user_session | date | year | month | day | hour | holiday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| event_time | ||||||||||||
| 2019-10-01 00:01:46+00:00 | view | 5843665 | 1487580005092295511 | 9.44 | 462033176 | a18e0999-61a1-4218-8f8f-61ec1d375361 | 2019-10-01 | 2019 | 10 | 1 | 0 | 0 |
| 2019-10-01 00:01:55+00:00 | cart | 5868461 | 1487580013069861041 | 3.57 | 514753614 | e2fecb2d-22d0-df2c-c661-15da44b3ccf1 | 2019-10-01 | 2019 | 10 | 1 | 0 | 0 |
| 2019-10-01 00:02:50+00:00 | view | 5877456 | 1487580006300255120 | 122.22 | 527418424 | 86e77869-afbc-4dff-9aa2-6b7dd8c90770 | 2019-10-01 | 2019 | 10 | 1 | 0 | 0 |
| 2019-10-01 00:03:41+00:00 | view | 5649270 | 1487580013749338323 | 6.19 | 555448072 | b5f72ceb-0730-44de-a932-d16db62390df | 2019-10-01 | 2019 | 10 | 1 | 0 | 0 |
| 2019-10-01 00:03:44+00:00 | view | 18082 | 1487580005411062629 | 16.03 | 552006247 | 2d8f304b-de45-4e59-8f40-50c603843fe5 | 2019-10-01 | 2019 | 10 | 1 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2020-02-29 23:58:49+00:00 | cart | 5815662 | 1487580006317032337 | 0.92 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 | 2020-02-29 | 2020 | 2 | 29 | 23 | 0 |
| 2020-02-29 23:58:57+00:00 | view | 5815665 | 1487580006317032337 | 0.59 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 | 2020-02-29 | 2020 | 2 | 29 | 23 | 0 |
| 2020-02-29 23:59:05+00:00 | cart | 5815665 | 1487580006317032337 | 0.59 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 | 2020-02-29 | 2020 | 2 | 29 | 23 | 0 |
| 2020-02-29 23:59:28+00:00 | view | 5817692 | 1487580010872045658 | 0.79 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d | 2020-02-29 | 2020 | 2 | 29 | 23 | 0 |
| 2020-02-29 23:59:54+00:00 | view | 5716351 | 1487580010872045658 | 0.79 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d | 2020-02-29 | 2020 | 2 | 29 | 23 | 0 |
2074026 rows × 12 columns
Check
df[df.holiday == 1].date.value_counts().sort_index()
2019-11-04 16430 2019-12-31 2848 2020-01-01 7644 2020-01-02 10776 2020-01-03 10617 2020-01-04 13084 2020-01-05 14554 2020-01-06 10621 2020-01-07 12922 2020-01-08 14004 2020-02-23 9817 Name: date, dtype: int64
We will add indicators for Black Friday and Valentine's Day.
df['black_friday'] = 0
df.loc['2019-11-29','black_friday'] = 1
df['valentin'] = 0
df.loc['2020-02-14','valentin'] = 1
Check
df['black_friday'].value_counts()
0 2051695 1 22331 Name: black_friday, dtype: int64
df['valentin'].value_counts()
0 2061781 1 12245 Name: valentin, dtype: int64
df.head()
| event_type | product_id | category_id | price | user_id | user_session | date | year | month | day | hour | holiday | black_friday | valentin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| event_time | ||||||||||||||
| 2019-10-01 00:01:46+00:00 | view | 5843665 | 1487580005092295511 | 9.44 | 462033176 | a18e0999-61a1-4218-8f8f-61ec1d375361 | 2019-10-01 | 2019 | 10 | 1 | 0 | 0 | 0 | 0 |
| 2019-10-01 00:01:55+00:00 | cart | 5868461 | 1487580013069861041 | 3.57 | 514753614 | e2fecb2d-22d0-df2c-c661-15da44b3ccf1 | 2019-10-01 | 2019 | 10 | 1 | 0 | 0 | 0 | 0 |
| 2019-10-01 00:02:50+00:00 | view | 5877456 | 1487580006300255120 | 122.22 | 527418424 | 86e77869-afbc-4dff-9aa2-6b7dd8c90770 | 2019-10-01 | 2019 | 10 | 1 | 0 | 0 | 0 | 0 |
| 2019-10-01 00:03:41+00:00 | view | 5649270 | 1487580013749338323 | 6.19 | 555448072 | b5f72ceb-0730-44de-a932-d16db62390df | 2019-10-01 | 2019 | 10 | 1 | 0 | 0 | 0 | 0 |
| 2019-10-01 00:03:44+00:00 | view | 18082 | 1487580005411062629 | 16.03 | 552006247 | 2d8f304b-de45-4e59-8f40-50c603843fe5 | 2019-10-01 | 2019 | 10 | 1 | 0 | 0 | 0 | 0 |
df.to_pickle('../datamart1.pickle')
df = pd.read_pickle('../datamart1.pickle')
We analyze the possible events:
events = df.event_type.value_counts()
events
view 961558 cart 574547 remove_from_cart 410357 purchase 127564 Name: event_type, dtype: int64
kpi_view_p = 100
kpi_cart_p = events.loc['cart'] / events.loc['view'] * 100
kpi_removed_p = events.loc['remove_from_cart'] / events.loc['cart'] * 100
kpi_purchase_p = events.loc['purchase'] / events.loc['cart'] * 100
kpis = pd.DataFrame({'kpi':['views','cart','purchase'],
'value':[kpi_view_p,kpi_cart_p,kpi_purchase_p]})
kpis
| kpi | value | |
|---|---|---|
| 0 | views | 100.00 |
| 1 | cart | 59.75 |
| 2 | purchase | 22.20 |
from plotly import graph_objects as go
fig = go.Figure(go.Funnel(
y = kpis.kpi,
x = kpis.value.round(2),
marker = {'color': ['red','blue','green']},
opacity = 0.3
))
fig.update_layout(
title = 'Funnel Conversion')
fig.show()
Conclusions:
Create df to a session and event level.
session_prod = df.groupby(['user_session','event_type']).product_id.count()
session_prod
user_session event_type
0000597b-de39-4a77-9fe5-02c8792ca14e view 3
0000645a-8160-4a3d-91bf-154bff0a22e3 view 2
000090e1-da13-42b1-a31b-91a9ee5e6a88 view 1
0000b3cb-5422-4bf2-b8fe-5c1831d0dc1b view 1
0000de26-bd58-42c9-9173-4763c76b398e view 1
..
ffff6695-b64d-4a67-aa14-34b3b7f63c3f view 2
ffff7d69-b706-4c64-9d6d-da57a04bc32b view 1
ffff8044-2a22-4846-8a72-999e870abbe9 view 1
ffff91d4-7879-4a4b-8b26-c67915a27dc8 view 1
ffffbe0a-d2c2-47c7-afab-680bfdfda50d view 1
Name: product_id, Length: 581763, dtype: int64
We move the events to columns.
session_prod = session_prod.unstack().fillna(0)
session_prod
| event_type | cart | purchase | remove_from_cart | view |
|---|---|---|---|---|
| user_session | ||||
| 0000597b-de39-4a77-9fe5-02c8792ca14e | 0.00 | 0.00 | 0.00 | 3.00 |
| 0000645a-8160-4a3d-91bf-154bff0a22e3 | 0.00 | 0.00 | 0.00 | 2.00 |
| 000090e1-da13-42b1-a31b-91a9ee5e6a88 | 0.00 | 0.00 | 0.00 | 1.00 |
| 0000b3cb-5422-4bf2-b8fe-5c1831d0dc1b | 0.00 | 0.00 | 0.00 | 1.00 |
| 0000de26-bd58-42c9-9173-4763c76b398e | 0.00 | 0.00 | 0.00 | 1.00 |
| ... | ... | ... | ... | ... |
| ffff6695-b64d-4a67-aa14-34b3b7f63c3f | 0.00 | 0.00 | 0.00 | 2.00 |
| ffff7d69-b706-4c64-9d6d-da57a04bc32b | 0.00 | 0.00 | 0.00 | 1.00 |
| ffff8044-2a22-4846-8a72-999e870abbe9 | 0.00 | 0.00 | 0.00 | 1.00 |
| ffff91d4-7879-4a4b-8b26-c67915a27dc8 | 0.00 | 0.00 | 0.00 | 1.00 |
| ffffbe0a-d2c2-47c7-afab-680bfdfda50d | 0.00 | 0.00 | 0.00 | 1.00 |
446054 rows × 4 columns
We compute the average of each event per session.
avg_events_session = session_prod.mean()
avg_events_session
event_type cart 1.29 purchase 0.29 remove_from_cart 0.92 view 2.16 dtype: float64
Conclusion:
In each session, on average:
Create df to a event-hour level.
events_hour = df.groupby(['event_type','hour']).product_id.count()
events_hour
event_type hour
cart 0 6475
1 5555
2 6433
3 8544
4 11242
...
view 19 63730
20 57311
21 38905
22 23043
23 13307
Name: product_id, Length: 96, dtype: int64
We move the events to the columns.
events_hour = events_hour.unstack(level = 0)
events_hour
| event_type | cart | purchase | remove_from_cart | view |
|---|---|---|---|---|
| hour | ||||
| 0 | 6475 | 962 | 3238 | 8731 |
| 1 | 5555 | 1128 | 3930 | 7280 |
| 2 | 6433 | 1220 | 3509 | 8378 |
| 3 | 8544 | 1535 | 5331 | 11807 |
| 4 | 11242 | 2389 | 8095 | 18365 |
| 5 | 16890 | 3491 | 11913 | 27438 |
| 6 | 21993 | 5125 | 16223 | 38055 |
| 7 | 27069 | 5951 | 17883 | 46072 |
| 8 | 29526 | 7158 | 21156 | 49587 |
| 9 | 32095 | 7593 | 21680 | 54185 |
| 10 | 32901 | 7816 | 23982 | 56458 |
| 11 | 33284 | 8495 | 25496 | 57594 |
| 12 | 34258 | 8250 | 23714 | 57530 |
| 13 | 31996 | 8133 | 22852 | 55534 |
| 14 | 30451 | 7122 | 21835 | 52184 |
| 15 | 28789 | 6485 | 20162 | 49809 |
| 16 | 28775 | 6531 | 19791 | 51055 |
| 17 | 32525 | 6242 | 24330 | 55667 |
| 18 | 36435 | 8211 | 30551 | 59533 |
| 19 | 39609 | 7435 | 27666 | 63730 |
| 20 | 34828 | 7256 | 24985 | 57311 |
| 21 | 23228 | 4606 | 17396 | 38905 |
| 22 | 13589 | 2883 | 8680 | 23043 |
| 23 | 8057 | 1547 | 5959 | 13307 |
Visualize how the events are distributed per hour.
events_hour.plot()
plt.xticks(ticks = events_hour.index);
There is a global pattern as expected.
In order to explore the differences we can create a new variable that is the ratio of purchases per visit in each hour.
events_hour['purchase_visits'] = events_hour.purchase / events_hour.view * 100
events_hour
| event_type | cart | purchase | remove_from_cart | view | purchase_visits |
|---|---|---|---|---|---|
| hour | |||||
| 0 | 6475 | 962 | 3238 | 8731 | 11.02 |
| 1 | 5555 | 1128 | 3930 | 7280 | 15.49 |
| 2 | 6433 | 1220 | 3509 | 8378 | 14.56 |
| 3 | 8544 | 1535 | 5331 | 11807 | 13.00 |
| 4 | 11242 | 2389 | 8095 | 18365 | 13.01 |
| 5 | 16890 | 3491 | 11913 | 27438 | 12.72 |
| 6 | 21993 | 5125 | 16223 | 38055 | 13.47 |
| 7 | 27069 | 5951 | 17883 | 46072 | 12.92 |
| 8 | 29526 | 7158 | 21156 | 49587 | 14.44 |
| 9 | 32095 | 7593 | 21680 | 54185 | 14.01 |
| 10 | 32901 | 7816 | 23982 | 56458 | 13.84 |
| 11 | 33284 | 8495 | 25496 | 57594 | 14.75 |
| 12 | 34258 | 8250 | 23714 | 57530 | 14.34 |
| 13 | 31996 | 8133 | 22852 | 55534 | 14.65 |
| 14 | 30451 | 7122 | 21835 | 52184 | 13.65 |
| 15 | 28789 | 6485 | 20162 | 49809 | 13.02 |
| 16 | 28775 | 6531 | 19791 | 51055 | 12.79 |
| 17 | 32525 | 6242 | 24330 | 55667 | 11.21 |
| 18 | 36435 | 8211 | 30551 | 59533 | 13.79 |
| 19 | 39609 | 7435 | 27666 | 63730 | 11.67 |
| 20 | 34828 | 7256 | 24985 | 57311 | 12.66 |
| 21 | 23228 | 4606 | 17396 | 38905 | 11.84 |
| 22 | 13589 | 2883 | 8680 | 23043 | 12.51 |
| 23 | 8057 | 1547 | 5959 | 13307 | 11.63 |
Visualize to see if there are hours when proportionally more is purchased.
plt.figure(figsize = (12,6))
sns.lineplot(data = events_hour, x = events_hour.index, y = 'purchase_visits')
plt.xticks(events_hour.index);
Conclusions:
df.loc[df.event_type == 'purchase'].groupby('month').price.sum().mean()
124309.92
trend = df.groupby('event_type').resample('W').event_type.count().unstack(level = 0)
trend
| event_type | cart | purchase | remove_from_cart | view |
|---|---|---|---|---|
| event_time | ||||
| 2019-10-06 00:00:00+00:00 | 31483 | 4440 | 14647 | 36353 |
| 2019-10-13 00:00:00+00:00 | 28151 | 5422 | 17989 | 44410 |
| 2019-10-20 00:00:00+00:00 | 23920 | 5033 | 15303 | 39486 |
| 2019-10-27 00:00:00+00:00 | 25651 | 5665 | 18411 | 40383 |
| 2019-11-03 00:00:00+00:00 | 24087 | 5746 | 16491 | 39365 |
| 2019-11-10 00:00:00+00:00 | 29142 | 6663 | 24008 | 46177 |
| 2019-11-17 00:00:00+00:00 | 25335 | 5141 | 17215 | 41170 |
| 2019-11-24 00:00:00+00:00 | 38069 | 9754 | 27973 | 56477 |
| 2019-12-01 00:00:00+00:00 | 31994 | 7493 | 23106 | 48883 |
| 2019-12-08 00:00:00+00:00 | 23265 | 5105 | 19443 | 42055 |
| 2019-12-15 00:00:00+00:00 | 24636 | 5953 | 18246 | 45874 |
| 2019-12-22 00:00:00+00:00 | 19927 | 4701 | 15452 | 39237 |
| 2019-12-29 00:00:00+00:00 | 17051 | 3705 | 11102 | 32803 |
| 2020-01-05 00:00:00+00:00 | 16735 | 3294 | 13464 | 31909 |
| 2020-01-12 00:00:00+00:00 | 26264 | 5589 | 17956 | 46873 |
| 2020-01-19 00:00:00+00:00 | 28402 | 6913 | 22945 | 50210 |
| 2020-01-26 00:00:00+00:00 | 26353 | 6359 | 18544 | 48478 |
| 2020-02-02 00:00:00+00:00 | 29193 | 7120 | 21102 | 52432 |
| 2020-02-09 00:00:00+00:00 | 28796 | 5853 | 20050 | 48422 |
| 2020-02-16 00:00:00+00:00 | 27836 | 6332 | 22601 | 47213 |
| 2020-02-23 00:00:00+00:00 | 25619 | 6000 | 18146 | 43627 |
| 2020-03-01 00:00:00+00:00 | 22638 | 5283 | 16163 | 39721 |
trend = trend[['view','cart','remove_from_cart','purchase']]
trend
| event_type | view | cart | remove_from_cart | purchase |
|---|---|---|---|---|
| event_time | ||||
| 2019-10-06 00:00:00+00:00 | 36353 | 31483 | 14647 | 4440 |
| 2019-10-13 00:00:00+00:00 | 44410 | 28151 | 17989 | 5422 |
| 2019-10-20 00:00:00+00:00 | 39486 | 23920 | 15303 | 5033 |
| 2019-10-27 00:00:00+00:00 | 40383 | 25651 | 18411 | 5665 |
| 2019-11-03 00:00:00+00:00 | 39365 | 24087 | 16491 | 5746 |
| 2019-11-10 00:00:00+00:00 | 46177 | 29142 | 24008 | 6663 |
| 2019-11-17 00:00:00+00:00 | 41170 | 25335 | 17215 | 5141 |
| 2019-11-24 00:00:00+00:00 | 56477 | 38069 | 27973 | 9754 |
| 2019-12-01 00:00:00+00:00 | 48883 | 31994 | 23106 | 7493 |
| 2019-12-08 00:00:00+00:00 | 42055 | 23265 | 19443 | 5105 |
| 2019-12-15 00:00:00+00:00 | 45874 | 24636 | 18246 | 5953 |
| 2019-12-22 00:00:00+00:00 | 39237 | 19927 | 15452 | 4701 |
| 2019-12-29 00:00:00+00:00 | 32803 | 17051 | 11102 | 3705 |
| 2020-01-05 00:00:00+00:00 | 31909 | 16735 | 13464 | 3294 |
| 2020-01-12 00:00:00+00:00 | 46873 | 26264 | 17956 | 5589 |
| 2020-01-19 00:00:00+00:00 | 50210 | 28402 | 22945 | 6913 |
| 2020-01-26 00:00:00+00:00 | 48478 | 26353 | 18544 | 6359 |
| 2020-02-02 00:00:00+00:00 | 52432 | 29193 | 21102 | 7120 |
| 2020-02-09 00:00:00+00:00 | 48422 | 28796 | 20050 | 5853 |
| 2020-02-16 00:00:00+00:00 | 47213 | 27836 | 22601 | 6332 |
| 2020-02-23 00:00:00+00:00 | 43627 | 25619 | 18146 | 6000 |
| 2020-03-01 00:00:00+00:00 | 39721 | 22638 | 16163 | 5283 |
trend.plot(subplots = True, figsize = (12,6), sharex = True, xticks = trend.index, x_compat=True, rot = 90);
There is a significant peak in the week of the 24th, obviously because of black friday, we are going to do the same analysis but daily and only for November and December to see the effect.
trend_daily = df.loc['2019-11':'2019-12'].groupby('event_type').resample('D').event_type.count().unstack(level = 0)
trend_daily
| event_type | cart | purchase | remove_from_cart | view |
|---|---|---|---|---|
| event_time | ||||
| 2019-11-01 00:00:00+00:00 | 3565 | 709 | 2810 | 5352 |
| 2019-11-02 00:00:00+00:00 | 3015 | 912 | 2124 | 4857 |
| 2019-11-03 00:00:00+00:00 | 3540 | 755 | 2622 | 5583 |
| 2019-11-04 00:00:00+00:00 | 4652 | 676 | 4854 | 6248 |
| 2019-11-05 00:00:00+00:00 | 4118 | 753 | 2711 | 7213 |
| ... | ... | ... | ... | ... |
| 2019-12-27 00:00:00+00:00 | 2023 | 507 | 1335 | 4058 |
| 2019-12-28 00:00:00+00:00 | 1744 | 329 | 1193 | 3704 |
| 2019-12-29 00:00:00+00:00 | 2134 | 263 | 1149 | 3939 |
| 2019-12-30 00:00:00+00:00 | 1364 | 258 | 823 | 3434 |
| 2019-12-31 00:00:00+00:00 | 563 | 114 | 447 | 1724 |
61 rows × 4 columns
trend_daily = trend_daily[['view','cart','remove_from_cart','purchase']]
trend_daily
| event_type | view | cart | remove_from_cart | purchase |
|---|---|---|---|---|
| event_time | ||||
| 2019-11-01 00:00:00+00:00 | 5352 | 3565 | 2810 | 709 |
| 2019-11-02 00:00:00+00:00 | 4857 | 3015 | 2124 | 912 |
| 2019-11-03 00:00:00+00:00 | 5583 | 3540 | 2622 | 755 |
| 2019-11-04 00:00:00+00:00 | 6248 | 4652 | 4854 | 676 |
| 2019-11-05 00:00:00+00:00 | 7213 | 4118 | 2711 | 753 |
| ... | ... | ... | ... | ... |
| 2019-12-27 00:00:00+00:00 | 4058 | 2023 | 1335 | 507 |
| 2019-12-28 00:00:00+00:00 | 3704 | 1744 | 1193 | 329 |
| 2019-12-29 00:00:00+00:00 | 3939 | 2134 | 1149 | 263 |
| 2019-12-30 00:00:00+00:00 | 3434 | 1364 | 823 | 258 |
| 2019-12-31 00:00:00+00:00 | 1724 | 563 | 447 | 114 |
61 rows × 4 columns
trend_daily.plot(subplots = True, figsize = (16,10), sharex = True, xticks = trend_daily.index, x_compat=True, rot = 90);
Conclusions:
We try to identify moments at the day-hour level when the highest number of purchases occur. It would be very useful to concentrate much of the campaign investment precisely at those times.
purchases_day_hour = df.loc[df.event_type == 'purchase'].groupby(['date','hour']).event_type.count().unstack(level = 0).fillna(0)
purchases_day_hour
| date | 2019-10-01 | 2019-10-02 | 2019-10-03 | 2019-10-04 | 2019-10-05 | 2019-10-06 | 2019-10-07 | 2019-10-08 | 2019-10-09 | 2019-10-10 | ... | 2020-02-20 | 2020-02-21 | 2020-02-22 | 2020-02-23 | 2020-02-24 | 2020-02-25 | 2020-02-26 | 2020-02-27 | 2020-02-28 | 2020-02-29 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| hour | |||||||||||||||||||||
| 0 | 13.00 | 18.00 | 1.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | 23.00 | ... | 28.00 | 0.00 | 0.00 | 0.00 | 0.00 | 55.00 | 5.00 | 40.00 | 0.00 | 0.00 |
| 1 | 0.00 | 0.00 | 5.00 | 0.00 | 4.00 | 24.00 | 3.00 | 0.00 | 16.00 | 0.00 | ... | 0.00 | 5.00 | 0.00 | 2.00 | 21.00 | 11.00 | 0.00 | 5.00 | 26.00 | 33.00 |
| 2 | 0.00 | 0.00 | 0.00 | 24.00 | 0.00 | 0.00 | 0.00 | 18.00 | 4.00 | 18.00 | ... | 9.00 | 0.00 | 0.00 | 10.00 | 0.00 | 34.00 | 0.00 | 0.00 | 0.00 | 8.00 |
| 3 | 0.00 | 24.00 | 10.00 | 0.00 | 0.00 | 26.00 | 2.00 | 20.00 | 16.00 | 55.00 | ... | 0.00 | 14.00 | 13.00 | 0.00 | 5.00 | 15.00 | 0.00 | 10.00 | 0.00 | 0.00 |
| 4 | 15.00 | 0.00 | 45.00 | 27.00 | 3.00 | 24.00 | 12.00 | 46.00 | 56.00 | 0.00 | ... | 6.00 | 22.00 | 34.00 | 17.00 | 7.00 | 16.00 | 10.00 | 148.00 | 16.00 | 0.00 |
| 5 | 49.00 | 9.00 | 6.00 | 17.00 | 1.00 | 14.00 | 21.00 | 10.00 | 0.00 | 43.00 | ... | 23.00 | 8.00 | 28.00 | 0.00 | 14.00 | 103.00 | 6.00 | 48.00 | 11.00 | 3.00 |
| 6 | 23.00 | 34.00 | 18.00 | 10.00 | 10.00 | 13.00 | 38.00 | 35.00 | 14.00 | 12.00 | ... | 7.00 | 22.00 | 5.00 | 46.00 | 26.00 | 20.00 | 94.00 | 26.00 | 58.00 | 35.00 |
| 7 | 26.00 | 60.00 | 26.00 | 54.00 | 58.00 | 20.00 | 63.00 | 27.00 | 26.00 | 59.00 | ... | 98.00 | 67.00 | 52.00 | 10.00 | 30.00 | 23.00 | 30.00 | 53.00 | 38.00 | 65.00 |
| 8 | 28.00 | 71.00 | 129.00 | 49.00 | 20.00 | 37.00 | 27.00 | 41.00 | 89.00 | 62.00 | ... | 30.00 | 95.00 | 35.00 | 52.00 | 30.00 | 17.00 | 120.00 | 80.00 | 67.00 | 25.00 |
| 9 | 24.00 | 34.00 | 90.00 | 61.00 | 44.00 | 8.00 | 56.00 | 43.00 | 28.00 | 78.00 | ... | 76.00 | 32.00 | 58.00 | 27.00 | 68.00 | 21.00 | 38.00 | 92.00 | 20.00 | 22.00 |
| 10 | 15.00 | 62.00 | 43.00 | 22.00 | 58.00 | 37.00 | 52.00 | 14.00 | 65.00 | 57.00 | ... | 31.00 | 45.00 | 10.00 | 44.00 | 55.00 | 88.00 | 66.00 | 29.00 | 64.00 | 19.00 |
| 11 | 95.00 | 80.00 | 83.00 | 36.00 | 33.00 | 44.00 | 53.00 | 66.00 | 47.00 | 13.00 | ... | 75.00 | 54.00 | 28.00 | 58.00 | 42.00 | 19.00 | 80.00 | 69.00 | 86.00 | 63.00 |
| 12 | 9.00 | 43.00 | 100.00 | 67.00 | 75.00 | 37.00 | 58.00 | 48.00 | 80.00 | 18.00 | ... | 107.00 | 71.00 | 48.00 | 38.00 | 38.00 | 82.00 | 52.00 | 57.00 | 40.00 | 58.00 |
| 13 | 16.00 | 76.00 | 69.00 | 18.00 | 31.00 | 40.00 | 44.00 | 109.00 | 11.00 | 67.00 | ... | 24.00 | 93.00 | 10.00 | 21.00 | 45.00 | 49.00 | 50.00 | 59.00 | 6.00 | 23.00 |
| 14 | 74.00 | 31.00 | 38.00 | 36.00 | 39.00 | 12.00 | 44.00 | 22.00 | 142.00 | 44.00 | ... | 34.00 | 34.00 | 66.00 | 47.00 | 27.00 | 34.00 | 70.00 | 51.00 | 26.00 | 44.00 |
| 15 | 25.00 | 10.00 | 45.00 | 28.00 | 128.00 | 116.00 | 9.00 | 31.00 | 93.00 | 39.00 | ... | 37.00 | 11.00 | 49.00 | 44.00 | 11.00 | 31.00 | 51.00 | 28.00 | 44.00 | 46.00 |
| 16 | 99.00 | 21.00 | 33.00 | 42.00 | 49.00 | 24.00 | 24.00 | 40.00 | 8.00 | 35.00 | ... | 74.00 | 196.00 | 6.00 | 4.00 | 72.00 | 82.00 | 55.00 | 10.00 | 14.00 | 59.00 |
| 17 | 88.00 | 80.00 | 55.00 | 31.00 | 8.00 | 1.00 | 53.00 | 34.00 | 10.00 | 8.00 | ... | 26.00 | 46.00 | 37.00 | 9.00 | 123.00 | 26.00 | 6.00 | 32.00 | 27.00 | 34.00 |
| 18 | 53.00 | 24.00 | 35.00 | 54.00 | 8.00 | 13.00 | 65.00 | 109.00 | 10.00 | 39.00 | ... | 161.00 | 56.00 | 11.00 | 37.00 | 19.00 | 42.00 | 98.00 | 220.00 | 46.00 | 55.00 |
| 19 | 29.00 | 25.00 | 19.00 | 14.00 | 31.00 | 47.00 | 97.00 | 70.00 | 87.00 | 46.00 | ... | 21.00 | 11.00 | 39.00 | 22.00 | 30.00 | 44.00 | 28.00 | 85.00 | 56.00 | 21.00 |
| 20 | 53.00 | 22.00 | 63.00 | 17.00 | 8.00 | 15.00 | 142.00 | 34.00 | 18.00 | 44.00 | ... | 41.00 | 38.00 | 8.00 | 37.00 | 42.00 | 36.00 | 85.00 | 7.00 | 12.00 | 15.00 |
| 21 | 1.00 | 55.00 | 25.00 | 42.00 | 12.00 | 9.00 | 38.00 | 8.00 | 9.00 | 21.00 | ... | 5.00 | 7.00 | 18.00 | 16.00 | 33.00 | 90.00 | 44.00 | 22.00 | 16.00 | 17.00 |
| 22 | 33.00 | 10.00 | 0.00 | 42.00 | 38.00 | 0.00 | 20.00 | 19.00 | 7.00 | 12.00 | ... | 10.00 | 7.00 | 33.00 | 10.00 | 20.00 | 15.00 | 5.00 | 49.00 | 2.00 | 21.00 |
| 23 | 0.00 | 0.00 | 7.00 | 0.00 | 0.00 | 26.00 | 0.00 | 0.00 | 16.00 | 1.00 | ... | 6.00 | 2.00 | 6.00 | 7.00 | 6.00 | 0.00 | 0.00 | 6.00 | 5.00 | 1.00 |
24 rows × 152 columns
plt.figure(figsize = (20,14))
sns.heatmap(purchases_day_hour);
Create df with the variables of interest per customer: total number of products, total number of purchases, average proce per product and date of last purchase.
customers = df.loc[df.event_type == 'purchase'].groupby(['user_id']).agg({'product_id':'count',
'user_session':'nunique',
'price': 'mean',
'date': 'max'})
customers
| product_id | user_session | price | date | |
|---|---|---|---|---|
| user_id | ||||
| 25392526 | 3 | 1 | 7.38 | 2019-12-18 |
| 27756757 | 1 | 1 | 20.63 | 2020-01-27 |
| 50748978 | 9 | 1 | 1.11 | 2019-12-14 |
| 52747911 | 3 | 1 | 7.67 | 2019-10-10 |
| 65241811 | 5 | 1 | 8.36 | 2019-11-11 |
| ... | ... | ... | ... | ... |
| 621995551 | 5 | 1 | 2.09 | 2020-02-29 |
| 622021687 | 1 | 1 | 13.33 | 2020-02-29 |
| 622041514 | 3 | 1 | 0.63 | 2020-02-29 |
| 622042698 | 3 | 1 | 28.04 | 2020-02-29 |
| 622065819 | 4 | 1 | 5.12 | 2020-02-29 |
11040 rows × 4 columns
Rename variables.
customers.columns = ['products_tot_num','purchases_tot_num','price_avg_prod','last_purchase']
customers
| products_tot_num | purchases_tot_num | price_avg_prod | last_purchase | |
|---|---|---|---|---|
| user_id | ||||
| 25392526 | 3 | 1 | 7.38 | 2019-12-18 |
| 27756757 | 1 | 1 | 20.63 | 2020-01-27 |
| 50748978 | 9 | 1 | 1.11 | 2019-12-14 |
| 52747911 | 3 | 1 | 7.67 | 2019-10-10 |
| 65241811 | 5 | 1 | 8.36 | 2019-11-11 |
| ... | ... | ... | ... | ... |
| 621995551 | 5 | 1 | 2.09 | 2020-02-29 |
| 622021687 | 1 | 1 | 13.33 | 2020-02-29 |
| 622041514 | 3 | 1 | 0.63 | 2020-02-29 |
| 622042698 | 3 | 1 | 28.04 | 2020-02-29 |
| 622065819 | 4 | 1 | 5.12 | 2020-02-29 |
11040 rows × 4 columns
Calculate additional variables: the customer's total spend, and the products per purchase.
customers['tot_expense'] = customers.products_tot_num * customers.price_avg_prod
customers['products_purchase'] = customers.products_tot_num / customers.purchases_tot_num
customers
| products_tot_num | purchases_tot_num | price_avg_prod | last_purchase | tot_expense | products_purchase | |
|---|---|---|---|---|---|---|
| user_id | ||||||
| 25392526 | 3 | 1 | 7.38 | 2019-12-18 | 22.14 | 3.00 |
| 27756757 | 1 | 1 | 20.63 | 2020-01-27 | 20.63 | 1.00 |
| 50748978 | 9 | 1 | 1.11 | 2019-12-14 | 10.01 | 9.00 |
| 52747911 | 3 | 1 | 7.67 | 2019-10-10 | 23.02 | 3.00 |
| 65241811 | 5 | 1 | 8.36 | 2019-11-11 | 41.79 | 5.00 |
| ... | ... | ... | ... | ... | ... | ... |
| 621995551 | 5 | 1 | 2.09 | 2020-02-29 | 10.46 | 5.00 |
| 622021687 | 1 | 1 | 13.33 | 2020-02-29 | 13.33 | 1.00 |
| 622041514 | 3 | 1 | 0.63 | 2020-02-29 | 1.90 | 3.00 |
| 622042698 | 3 | 1 | 28.04 | 2020-02-29 | 84.13 | 3.00 |
| 622065819 | 4 | 1 | 5.12 | 2020-02-29 | 20.48 | 4.00 |
11040 rows × 6 columns
For each of the customers we have:
sns.histplot(data = customers, x = 'tot_expense', bins = 100)
plt.xlim([0,300]);
The vast majority of customers have spent less than €50.
sns.countplot(data = customers, x = 'purchases_tot_num');
The vast majority of customers make only one purchase.
customers.products_purchase.describe()
count 11040.00 mean 7.79 std 9.49 min 1.00 25% 3.00 50% 5.00 75% 10.00 max 219.00 Name: products_purchase, dtype: float64
The maximum is quite high, probably from a customer who resells the products. To get rid of the outlier we analize the median instead of the mean.
The median purchase includes 5 products.
But 25% of customers buy more than 10 products in the same purchase. There is a lot of room for improvement in this ratio, for example, through recommender systems at the time of purchase
We calculate the best 10 customers with the average expenditure per customer.
customers.nlargest(n = 10, columns = 'tot_expense')
| products_tot_num | purchases_tot_num | price_avg_prod | last_purchase | tot_expense | products_purchase | |
|---|---|---|---|---|---|---|
| user_id | ||||||
| 573823111 | 268 | 2 | 5.82 | 2020-02-21 | 1559.21 | 134.00 |
| 539751397 | 236 | 13 | 6.16 | 2020-02-19 | 1453.37 | 18.15 |
| 556579890 | 506 | 4 | 2.75 | 2020-02-27 | 1392.45 | 126.50 |
| 442763940 | 195 | 8 | 6.37 | 2019-12-23 | 1241.53 | 24.38 |
| 561592095 | 94 | 3 | 11.81 | 2019-10-31 | 1109.70 | 31.33 |
| 527739278 | 244 | 13 | 4.39 | 2020-02-16 | 1071.00 | 18.77 |
| 527806771 | 195 | 13 | 4.86 | 2020-02-20 | 948.01 | 15.00 |
| 430220205 | 190 | 6 | 4.99 | 2020-02-29 | 947.30 | 31.67 |
| 491009486 | 219 | 1 | 4.32 | 2020-02-12 | 946.20 | 219.00 |
| 520501669 | 64 | 11 | 14.27 | 2020-01-17 | 913.01 | 5.82 |
Calculate the average total expenditure per customer.
customers.tot_expense.mean()
56.299782608695594
customers.nlargest(n = 10, columns = 'tot_expense').tot_expense/customers.tot_expense.mean()
user_id 573823111 27.69 539751397 25.81 556579890 24.73 442763940 22.05 561592095 19.71 527739278 19.02 527806771 16.84 430220205 16.83 491009486 16.81 520501669 16.22 Name: tot_expense, dtype: float64
There are customers with average spending dozens of times higher than the average.
These customers need to be retained through loyalty programs.
Since we only have 5 months of history we will create a 3-month cohort analysis, which gives us 3 cohorts.
Create df just with the users who purchased a product and the month.
c = df.loc[df.event_type == 'purchase', ['user_id','month']]
c
| user_id | month | |
|---|---|---|
| event_time | ||
| 2019-10-01 00:26:49+00:00 | 536128518 | 10 |
| 2019-10-01 00:26:49+00:00 | 536128518 | 10 |
| 2019-10-01 00:26:49+00:00 | 536128518 | 10 |
| 2019-10-01 00:26:49+00:00 | 536128518 | 10 |
| 2019-10-01 00:26:49+00:00 | 536128518 | 10 |
| ... | ... | ... |
| 2020-02-29 22:29:19+00:00 | 622065819 | 2 |
| 2020-02-29 22:29:19+00:00 | 622065819 | 2 |
| 2020-02-29 22:29:19+00:00 | 622065819 | 2 |
| 2020-02-29 22:29:19+00:00 | 622065819 | 2 |
| 2020-02-29 23:26:42+00:00 | 610361057 | 2 |
127564 rows × 2 columns
Move the months to the columns.
c = pd.crosstab(c.user_id,c.month).reset_index()
c
| month | user_id | 1 | 2 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|
| 0 | 25392526 | 0 | 0 | 0 | 0 | 3 |
| 1 | 27756757 | 1 | 0 | 0 | 0 | 0 |
| 2 | 50748978 | 0 | 0 | 0 | 0 | 9 |
| 3 | 52747911 | 0 | 0 | 3 | 0 | 0 |
| 4 | 65241811 | 0 | 0 | 0 | 5 | 0 |
| ... | ... | ... | ... | ... | ... | ... |
| 11035 | 621995551 | 0 | 5 | 0 | 0 | 0 |
| 11036 | 622021687 | 0 | 1 | 0 | 0 | 0 |
| 11037 | 622041514 | 0 | 3 | 0 | 0 | 0 |
| 11038 | 622042698 | 0 | 3 | 0 | 0 | 0 |
| 11039 | 622065819 | 0 | 4 | 0 | 0 | 0 |
11040 rows × 6 columns
Delete user_id, we don't need it anymore. The period starts in october, so the month 10 is the cohort 1.
c.columns = ['user_id','c4','c5','c1','c2','c3']
c.drop(columns = 'user_id', inplace = True)
c
| c4 | c5 | c1 | c2 | c3 | |
|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 3 |
| 1 | 1 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 9 |
| 3 | 0 | 0 | 3 | 0 | 0 |
| 4 | 0 | 0 | 0 | 5 | 0 |
| ... | ... | ... | ... | ... | ... |
| 11035 | 0 | 5 | 0 | 0 | 0 |
| 11036 | 0 | 1 | 0 | 0 | 0 |
| 11037 | 0 | 3 | 0 | 0 | 0 |
| 11038 | 0 | 3 | 0 | 0 | 0 |
| 11039 | 0 | 4 | 0 | 0 | 0 |
11040 rows × 5 columns
The first cohort will be that of month 2. We select "new" customers (those who were not in the previous month).
c2 = c.loc[(c.c1 == 0) & (c.c2 > 0)]
c2
| c4 | c5 | c1 | c2 | c3 | |
|---|---|---|---|---|---|
| 4 | 0 | 0 | 0 | 5 | 0 |
| 6 | 0 | 0 | 0 | 10 | 0 |
| 8 | 0 | 0 | 0 | 27 | 17 |
| 9 | 0 | 0 | 0 | 3 | 0 |
| 13 | 0 | 0 | 0 | 4 | 0 |
| ... | ... | ... | ... | ... | ... |
| 7702 | 0 | 0 | 0 | 5 | 0 |
| 7703 | 0 | 5 | 0 | 2 | 0 |
| 7705 | 0 | 0 | 0 | 1 | 0 |
| 7708 | 0 | 0 | 0 | 5 | 6 |
| 7709 | 0 | 0 | 0 | 1 | 0 |
2640 rows × 5 columns
We convert to a binary dataframe ( we only care if that customer has purchased or not in each month, not the number of purchases)
def binarize(variable):
variable = variable.transform(lambda x: 1 if (x > 0) else 0)
return(variable)
c2_b = c2.apply(binarize)
c2_b
| c4 | c5 | c1 | c2 | c3 | |
|---|---|---|---|---|---|
| 4 | 0 | 0 | 0 | 1 | 0 |
| 6 | 0 | 0 | 0 | 1 | 0 |
| 8 | 0 | 0 | 0 | 1 | 1 |
| 9 | 0 | 0 | 0 | 1 | 0 |
| 13 | 0 | 0 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... |
| 7702 | 0 | 0 | 0 | 1 | 0 |
| 7703 | 0 | 1 | 0 | 1 | 0 |
| 7705 | 0 | 0 | 0 | 1 | 0 |
| 7708 | 0 | 0 | 0 | 1 | 1 |
| 7709 | 0 | 0 | 0 | 1 | 0 |
2640 rows × 5 columns
Compure the percentage of customers in this cohort who have continued to buy in the following months.
c2_f=c2_b.mean().sort_index()
c2_f
c1 0.00 c2 1.00 c3 0.10 c4 0.10 c5 0.08 dtype: float64
Cohort 3
c3 = c.loc[(c.c2 == 0) & (c.c3 > 0)]
c3_b = c3.apply(binarize)
c3_f = c3_b.mean().sort_index()
c3_f = c3_f.sort_index()
c3_f['c1'] = 0
c3_f
c1 0.00 c2 0.00 c3 1.00 c4 0.10 c5 0.08 dtype: float64
Cohort 4
c4 = c.loc[(c.c3 == 0) & (c.c4 > 0)]
c4_b = c4.apply(binarize)
c4_f = c4_b.mean().sort_index()
c4_f = c4_f.sort_index()
c4_f['c1'] = 0
c4_f['c2'] = 0
c4_f
c1 0.00 c2 0.00 c3 0.00 c4 1.00 c5 0.12 dtype: float64
Cretae df with all cohorts.
cohorts = pd.DataFrame({'c2':c2_f,'c3':c3_f,'c4':c4_f})
cohorts
| c2 | c3 | c4 | |
|---|---|---|---|
| c1 | 0.00 | 0.00 | 0.00 |
| c2 | 1.00 | 0.00 | 0.00 |
| c3 | 0.10 | 1.00 | 0.00 |
| c4 | 0.10 | 0.10 | 1.00 |
| c5 | 0.08 | 0.08 | 0.12 |
cohorts = cohorts.drop(index = 'c1').T
cohorts
| c2 | c3 | c4 | c5 | |
|---|---|---|---|---|
| c2 | 1.00 | 0.10 | 0.10 | 0.08 |
| c3 | 0.00 | 1.00 | 0.10 | 0.08 |
| c4 | 0.00 | 0.00 | 1.00 | 0.12 |
plt.figure(figsize = (12,6))
sns.heatmap(cohorts,annot = True, fmt = '.0%', cmap='Greys');
Conclusion:
90% of new customers do not buy again in the following months.
We bulid the RFM matrix (Recency - Frequency - Monetary)
The variables for each dimension are:
We discretize the variables and create 5 groups.
customers['F'] = customers.purchases_tot_num.transform(lambda x: pd.cut(x,5, labels = False)) + 1
customers
| products_tot_num | purchases_tot_num | price_avg_prod | last_purchase | tot_expense | products_purchase | F | |
|---|---|---|---|---|---|---|---|
| user_id | |||||||
| 25392526 | 3 | 1 | 7.38 | 2019-12-18 | 22.14 | 3.00 | 1 |
| 27756757 | 1 | 1 | 20.63 | 2020-01-27 | 20.63 | 1.00 | 1 |
| 50748978 | 9 | 1 | 1.11 | 2019-12-14 | 10.01 | 9.00 | 1 |
| 52747911 | 3 | 1 | 7.67 | 2019-10-10 | 23.02 | 3.00 | 1 |
| 65241811 | 5 | 1 | 8.36 | 2019-11-11 | 41.79 | 5.00 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 621995551 | 5 | 1 | 2.09 | 2020-02-29 | 10.46 | 5.00 | 1 |
| 622021687 | 1 | 1 | 13.33 | 2020-02-29 | 13.33 | 1.00 | 1 |
| 622041514 | 3 | 1 | 0.63 | 2020-02-29 | 1.90 | 3.00 | 1 |
| 622042698 | 3 | 1 | 28.04 | 2020-02-29 | 84.13 | 3.00 | 1 |
| 622065819 | 4 | 1 | 5.12 | 2020-02-29 | 20.48 | 4.00 | 1 |
11040 rows × 7 columns
Check
customers.groupby('F').purchases_tot_num.mean()
F 1 1.31 2 7.06 3 12.00 4 16.50 5 23.50 Name: purchases_tot_num, dtype: float64
Create monetary
customers['M'] = customers.tot_expense.transform(lambda x: pd.cut(x,5, labels = False)) + 1
customers.groupby('M').tot_expense.mean()
M 1 48.36 2 410.98 3 765.18 4 1043.96 5 1468.34 Name: tot_expense, dtype: float64
Create recency. It should be a number not a date. Therefore, we define it as the distance in days from the most recent dae.
most_recent = customers.last_purchase.max()
customers['last_purchase_days'] = customers.last_purchase.transform(lambda x: most_recent - x)
customers
| products_tot_num | purchases_tot_num | price_avg_prod | last_purchase | tot_expense | products_purchase | F | M | last_purchase_days | |
|---|---|---|---|---|---|---|---|---|---|
| user_id | |||||||||
| 25392526 | 3 | 1 | 7.38 | 2019-12-18 | 22.14 | 3.00 | 1 | 1 | 73 days |
| 27756757 | 1 | 1 | 20.63 | 2020-01-27 | 20.63 | 1.00 | 1 | 1 | 33 days |
| 50748978 | 9 | 1 | 1.11 | 2019-12-14 | 10.01 | 9.00 | 1 | 1 | 77 days |
| 52747911 | 3 | 1 | 7.67 | 2019-10-10 | 23.02 | 3.00 | 1 | 1 | 142 days |
| 65241811 | 5 | 1 | 8.36 | 2019-11-11 | 41.79 | 5.00 | 1 | 1 | 110 days |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 621995551 | 5 | 1 | 2.09 | 2020-02-29 | 10.46 | 5.00 | 1 | 1 | 0 days |
| 622021687 | 1 | 1 | 13.33 | 2020-02-29 | 13.33 | 1.00 | 1 | 1 | 0 days |
| 622041514 | 3 | 1 | 0.63 | 2020-02-29 | 1.90 | 3.00 | 1 | 1 | 0 days |
| 622042698 | 3 | 1 | 28.04 | 2020-02-29 | 84.13 | 3.00 | 1 | 1 | 0 days |
| 622065819 | 4 | 1 | 5.12 | 2020-02-29 | 20.48 | 4.00 | 1 | 1 | 0 days |
11040 rows × 9 columns
customers['last_purchase_days'] = customers.last_purchase_days.dt.days
customers
| products_tot_num | purchases_tot_num | price_avg_prod | last_purchase | tot_expense | products_purchase | F | M | last_purchase_days | |
|---|---|---|---|---|---|---|---|---|---|
| user_id | |||||||||
| 25392526 | 3 | 1 | 7.38 | 2019-12-18 | 22.14 | 3.00 | 1 | 1 | 73 |
| 27756757 | 1 | 1 | 20.63 | 2020-01-27 | 20.63 | 1.00 | 1 | 1 | 33 |
| 50748978 | 9 | 1 | 1.11 | 2019-12-14 | 10.01 | 9.00 | 1 | 1 | 77 |
| 52747911 | 3 | 1 | 7.67 | 2019-10-10 | 23.02 | 3.00 | 1 | 1 | 142 |
| 65241811 | 5 | 1 | 8.36 | 2019-11-11 | 41.79 | 5.00 | 1 | 1 | 110 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 621995551 | 5 | 1 | 2.09 | 2020-02-29 | 10.46 | 5.00 | 1 | 1 | 0 |
| 622021687 | 1 | 1 | 13.33 | 2020-02-29 | 13.33 | 1.00 | 1 | 1 | 0 |
| 622041514 | 3 | 1 | 0.63 | 2020-02-29 | 1.90 | 3.00 | 1 | 1 | 0 |
| 622042698 | 3 | 1 | 28.04 | 2020-02-29 | 84.13 | 3.00 | 1 | 1 | 0 |
| 622065819 | 4 | 1 | 5.12 | 2020-02-29 | 20.48 | 4.00 | 1 | 1 | 0 |
11040 rows × 9 columns
We can now create the R, but keep in mind that in this case the lowest values are best, so it must be inverted to be consistent with the M and F.
customers['R'] = customers.last_purchase_days.transform(lambda x: pd.cut(x,5, labels = False)) + 1
customers.groupby('R').last_purchase_days.mean()
R 1 14.62 2 43.04 3 75.94 4 103.85 5 135.91 Name: last_purchase_days, dtype: float64
customers['R'] = 6 - customers.R
customers.groupby('R').last_purchase_days.mean()
R 1 135.91 2 103.85 3 75.94 4 43.04 5 14.62 Name: last_purchase_days, dtype: float64
Create additional variables that unifies the R, F and M:
customers['value'] = customers.R + customers.F + customers.M
customers['RFM'] = customers.apply(lambda x: str(x.R) + str(x.F) + str(x.M), axis = 1)
customers
| products_tot_num | purchases_tot_num | price_avg_prod | last_purchase | tot_expense | products_purchase | F | M | last_purchase_days | R | value | RFM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| user_id | ||||||||||||
| 25392526 | 3 | 1 | 7.38 | 2019-12-18 | 22.14 | 3.00 | 1 | 1 | 73 | 3 | 5 | 311 |
| 27756757 | 1 | 1 | 20.63 | 2020-01-27 | 20.63 | 1.00 | 1 | 1 | 33 | 4 | 6 | 411 |
| 50748978 | 9 | 1 | 1.11 | 2019-12-14 | 10.01 | 9.00 | 1 | 1 | 77 | 3 | 5 | 311 |
| 52747911 | 3 | 1 | 7.67 | 2019-10-10 | 23.02 | 3.00 | 1 | 1 | 142 | 1 | 3 | 111 |
| 65241811 | 5 | 1 | 8.36 | 2019-11-11 | 41.79 | 5.00 | 1 | 1 | 110 | 2 | 4 | 211 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 621995551 | 5 | 1 | 2.09 | 2020-02-29 | 10.46 | 5.00 | 1 | 1 | 0 | 5 | 7 | 511 |
| 622021687 | 1 | 1 | 13.33 | 2020-02-29 | 13.33 | 1.00 | 1 | 1 | 0 | 5 | 7 | 511 |
| 622041514 | 3 | 1 | 0.63 | 2020-02-29 | 1.90 | 3.00 | 1 | 1 | 0 | 5 | 7 | 511 |
| 622042698 | 3 | 1 | 28.04 | 2020-02-29 | 84.13 | 3.00 | 1 | 1 | 0 | 5 | 7 | 511 |
| 622065819 | 4 | 1 | 5.12 | 2020-02-29 | 20.48 | 4.00 | 1 | 1 | 0 | 5 | 7 | 511 |
11040 rows × 12 columns
On this dataframe we create a cube to make the analysis.
metrics = ['products_tot_num','purchases_tot_num','tot_expense']
dimensions = ['R','F','M','RFM','value']
cube = customers[dimensions + metrics]
cube
| R | F | M | RFM | value | products_tot_num | purchases_tot_num | tot_expense | |
|---|---|---|---|---|---|---|---|---|
| user_id | ||||||||
| 25392526 | 3 | 1 | 1 | 311 | 5 | 3 | 1 | 22.14 |
| 27756757 | 4 | 1 | 1 | 411 | 6 | 1 | 1 | 20.63 |
| 50748978 | 3 | 1 | 1 | 311 | 5 | 9 | 1 | 10.01 |
| 52747911 | 1 | 1 | 1 | 111 | 3 | 3 | 1 | 23.02 |
| 65241811 | 2 | 1 | 1 | 211 | 4 | 5 | 1 | 41.79 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 621995551 | 5 | 1 | 1 | 511 | 7 | 5 | 1 | 10.46 |
| 622021687 | 5 | 1 | 1 | 511 | 7 | 1 | 1 | 13.33 |
| 622041514 | 5 | 1 | 1 | 511 | 7 | 3 | 1 | 1.90 |
| 622042698 | 5 | 1 | 1 | 511 | 7 | 3 | 1 | 84.13 |
| 622065819 | 5 | 1 | 1 | 511 | 7 | 4 | 1 | 20.48 |
11040 rows × 8 columns
cube = cube.melt(id_vars = metrics)
cube
C:\Users\blara\AppData\Local\Temp\ipykernel_22888\3190711699.py:1: FutureWarning: This dataframe has a column name that matches the 'value_name' column name of the resulting Dataframe. In the future this will raise an error, please set the 'value_name' parameter of DataFrame.melt to a unique name.
| products_tot_num | purchases_tot_num | tot_expense | variable | value | |
|---|---|---|---|---|---|
| 0 | 3 | 1 | 22.14 | R | 3 |
| 1 | 1 | 1 | 20.63 | R | 4 |
| 2 | 9 | 1 | 10.01 | R | 3 |
| 3 | 3 | 1 | 23.02 | R | 1 |
| 4 | 5 | 1 | 41.79 | R | 2 |
| ... | ... | ... | ... | ... | ... |
| 55195 | 5 | 1 | 10.46 | value | 7 |
| 55196 | 1 | 1 | 13.33 | value | 7 |
| 55197 | 3 | 1 | 1.90 | value | 7 |
| 55198 | 3 | 1 | 84.13 | value | 7 |
| 55199 | 4 | 1 | 20.48 | value | 7 |
55200 rows × 5 columns
cube = cube.groupby(['variable','value'], as_index = False)[metrics].mean()
cube
| variable | value | products_tot_num | purchases_tot_num | tot_expense | |
|---|---|---|---|---|---|
| 0 | F | 1 | 10.61 | 1.31 | 52.09 |
| 1 | F | 2 | 71.42 | 7.06 | 320.47 |
| 2 | F | 3 | 123.64 | 12.00 | 643.20 |
| 3 | F | 4 | 156.75 | 16.50 | 560.15 |
| 4 | F | 5 | 124.00 | 23.50 | 652.42 |
| ... | ... | ... | ... | ... | ... |
| 58 | value | 9 | 98.02 | 7.25 | 491.71 |
| 59 | value | 10 | 140.89 | 10.22 | 625.93 |
| 60 | value | 11 | 291.00 | 5.75 | 1189.31 |
| 61 | value | 12 | 189.80 | 16.60 | 833.43 |
| 62 | value | 13 | 179.00 | 18.00 | 1136.70 |
63 rows × 5 columns
We analize each dimension of the cube and visualize it.
cube[cube.variable == 'F']
| variable | value | products_tot_num | purchases_tot_num | tot_expense | |
|---|---|---|---|---|---|
| 0 | F | 1 | 10.61 | 1.31 | 52.09 |
| 1 | F | 2 | 71.42 | 7.06 | 320.47 |
| 2 | F | 3 | 123.64 | 12.00 | 643.20 |
| 3 | F | 4 | 156.75 | 16.50 | 560.15 |
| 4 | F | 5 | 124.00 | 23.50 | 652.42 |
Average number of products, total number of purchases and total expediture made by each customer's group.
cube[cube.variable == 'F'].set_index('value').plot.bar(subplots = True, sharex = False, figsize = (12,12))
plt.tight_layout(); # avoid overlap in labeling
cube[cube.variable == 'R']
| variable | value | products_tot_num | purchases_tot_num | tot_expense | |
|---|---|---|---|---|---|
| 10 | R | 1 | 8.15 | 1.08 | 41.56 |
| 11 | R | 2 | 9.25 | 1.18 | 45.58 |
| 12 | R | 3 | 9.54 | 1.29 | 47.25 |
| 13 | R | 4 | 11.72 | 1.44 | 58.19 |
| 14 | R | 5 | 16.83 | 1.82 | 79.04 |
cube[cube.variable == 'R'].set_index('value').plot.bar(subplots = True, sharex = False, figsize = (12,12))
plt.tight_layout();
cube[cube.variable == 'M']
| variable | value | products_tot_num | purchases_tot_num | tot_expense | |
|---|---|---|---|---|---|
| 5 | M | 1 | 10.12 | 1.34 | 48.36 |
| 6 | M | 2 | 74.28 | 4.31 | 410.98 |
| 7 | M | 3 | 138.50 | 6.86 | 765.18 |
| 8 | M | 4 | 189.50 | 7.33 | 1043.96 |
| 9 | M | 5 | 336.67 | 6.33 | 1468.34 |
cube[cube.variable == 'M'].set_index('value').plot.bar(subplots = True, sharex = False, figsize = (12,12))
plt.tight_layout();
cube[cube.variable == 'RFM']
| variable | value | products_tot_num | purchases_tot_num | tot_expense | |
|---|---|---|---|---|---|
| 15 | RFM | 111 | 7.97 | 1.08 | 39.28 |
| 16 | RFM | 112 | 37.38 | 1.50 | 397.98 |
| 17 | RFM | 114 | 94.00 | 3.00 | 1109.70 |
| 18 | RFM | 211 | 8.94 | 1.17 | 43.16 |
| 19 | RFM | 212 | 47.00 | 1.73 | 391.14 |
| 20 | RFM | 213 | 35.50 | 1.00 | 662.84 |
| 21 | RFM | 221 | 85.67 | 6.33 | 194.48 |
| 22 | RFM | 311 | 8.98 | 1.26 | 43.91 |
| 23 | RFM | 312 | 53.46 | 2.46 | 376.38 |
| 24 | RFM | 321 | 39.00 | 6.50 | 141.28 |
| 25 | RFM | 322 | 121.00 | 6.50 | 389.19 |
| 26 | RFM | 324 | 195.00 | 8.00 | 1241.53 |
| 27 | RFM | 411 | 10.37 | 1.36 | 49.99 |
| 28 | RFM | 412 | 60.69 | 2.54 | 404.22 |
| 29 | RFM | 413 | 130.50 | 2.50 | 868.30 |
| 30 | RFM | 421 | 41.00 | 6.92 | 205.31 |
| 31 | RFM | 422 | 92.88 | 7.12 | 440.47 |
| 32 | RFM | 423 | 153.33 | 7.33 | 672.97 |
| 33 | RFM | 433 | 64.00 | 11.00 | 913.01 |
| 34 | RFM | 511 | 11.94 | 1.50 | 56.07 |
| 35 | RFM | 512 | 78.79 | 3.05 | 399.18 |
| 36 | RFM | 513 | 133.00 | 3.14 | 742.48 |
| 37 | RFM | 514 | 219.00 | 1.00 | 946.20 |
| 38 | RFM | 515 | 387.00 | 3.00 | 1475.83 |
| 39 | RFM | 521 | 47.88 | 6.83 | 192.19 |
| 40 | RFM | 522 | 89.51 | 7.51 | 442.50 |
| 41 | RFM | 523 | 184.67 | 7.33 | 766.90 |
| 42 | RFM | 524 | 190.00 | 6.00 | 947.30 |
| 43 | RFM | 531 | 58.25 | 11.50 | 233.84 |
| 44 | RFM | 532 | 94.00 | 12.50 | 448.09 |
| 45 | RFM | 533 | 200.00 | 11.00 | 858.26 |
| 46 | RFM | 534 | 219.50 | 13.00 | 1009.51 |
| 47 | RFM | 535 | 236.00 | 13.00 | 1453.37 |
| 48 | RFM | 541 | 121.50 | 16.50 | 288.65 |
| 49 | RFM | 543 | 192.00 | 16.50 | 831.65 |
| 50 | RFM | 552 | 126.00 | 24.00 | 484.81 |
| 51 | RFM | 553 | 122.00 | 23.00 | 820.04 |
cube[cube.variable == 'RFM'].set_index('value').plot.bar(subplots = True, sharex = False, figsize = (12,12))
plt.tight_layout(); # Avoid overlaping in labeling
cube[cube.variable == 'value']
| variable | value | products_tot_num | purchases_tot_num | tot_expense | |
|---|---|---|---|---|---|
| 52 | value | 3 | 7.97 | 1.08 | 39.28 |
| 53 | value | 4 | 9.04 | 1.17 | 44.41 |
| 54 | value | 5 | 9.31 | 1.27 | 46.08 |
| 55 | value | 6 | 10.73 | 1.38 | 53.07 |
| 56 | value | 7 | 12.64 | 1.54 | 60.42 |
| 57 | value | 8 | 66.13 | 5.06 | 313.01 |
| 58 | value | 9 | 98.02 | 7.25 | 491.71 |
| 59 | value | 10 | 140.89 | 10.22 | 625.93 |
| 60 | value | 11 | 291.00 | 5.75 | 1189.31 |
| 61 | value | 12 | 189.80 | 16.60 | 833.43 |
| 62 | value | 13 | 179.00 | 18.00 | 1136.70 |
cube[cube.variable == 'value'].set_index('value').plot.bar(subplots = True, sharex = False, figsize = (12,12))
plt.tight_layout();
The result identifies the best customers to target when launching loyalty campaigns.
Create df in a product level with the events and the price of the product.
prod = df.groupby(['product_id','event_type']).size()
prod
product_id event_type
3752 view 10
3762 cart 127
purchase 28
remove_from_cart 59
view 258
...
5932538 view 1
5932540 cart 1
view 2
5932578 view 1
5932585 view 2
Length: 137068, dtype: int64
prod = prod.unstack(level = 1).fillna(0)
prod
| event_type | cart | purchase | remove_from_cart | view |
|---|---|---|---|---|
| product_id | ||||
| 3752 | 0.00 | 0.00 | 0.00 | 10.00 |
| 3762 | 127.00 | 28.00 | 59.00 | 258.00 |
| 3763 | 10.00 | 2.00 | 2.00 | 51.00 |
| 3771 | 0.00 | 0.00 | 0.00 | 9.00 |
| 3774 | 26.00 | 7.00 | 13.00 | 76.00 |
| ... | ... | ... | ... | ... |
| 5932537 | 1.00 | 0.00 | 0.00 | 1.00 |
| 5932538 | 0.00 | 0.00 | 0.00 | 1.00 |
| 5932540 | 1.00 | 0.00 | 0.00 | 2.00 |
| 5932578 | 0.00 | 0.00 | 0.00 | 1.00 |
| 5932585 | 0.00 | 0.00 | 0.00 | 2.00 |
45327 rows × 4 columns
We include the average price of each product, since the price might have changed over the time.
master_prices = df.groupby('product_id', as_index = False).price.mean()
master_prices
| product_id | price | |
|---|---|---|
| 0 | 3752 | 15.71 |
| 1 | 3762 | 19.29 |
| 2 | 3763 | 16.00 |
| 3 | 3771 | 15.08 |
| 4 | 3774 | 15.92 |
| ... | ... | ... |
| 45322 | 5932537 | 1.43 |
| 45323 | 5932538 | 1.43 |
| 45324 | 5932540 | 1.43 |
| 45325 | 5932578 | 6.02 |
| 45326 | 5932585 | 6.33 |
45327 rows × 2 columns
We integrate both dataframes.
prod = pd.merge(left = prod, right = master_prices, how = 'left', on = 'product_id')
prod
| product_id | cart | purchase | remove_from_cart | view | price | |
|---|---|---|---|---|---|---|
| 0 | 3752 | 0.00 | 0.00 | 0.00 | 10.00 | 15.71 |
| 1 | 3762 | 127.00 | 28.00 | 59.00 | 258.00 | 19.29 |
| 2 | 3763 | 10.00 | 2.00 | 2.00 | 51.00 | 16.00 |
| 3 | 3771 | 0.00 | 0.00 | 0.00 | 9.00 | 15.08 |
| 4 | 3774 | 26.00 | 7.00 | 13.00 | 76.00 | 15.92 |
| ... | ... | ... | ... | ... | ... | ... |
| 45322 | 5932537 | 1.00 | 0.00 | 0.00 | 1.00 | 1.43 |
| 45323 | 5932538 | 0.00 | 0.00 | 0.00 | 1.00 | 1.43 |
| 45324 | 5932540 | 1.00 | 0.00 | 0.00 | 2.00 | 1.43 |
| 45325 | 5932578 | 0.00 | 0.00 | 0.00 | 1.00 | 6.02 |
| 45326 | 5932585 | 0.00 | 0.00 | 0.00 | 2.00 | 6.33 |
45327 rows × 6 columns
prod.sort_values('purchase',ascending = False)[0:10]
| product_id | cart | purchase | remove_from_cart | view | price | |
|---|---|---|---|---|---|---|
| 16807 | 5809910 | 2796.00 | 764.00 | 1249.00 | 9195.00 | 5.21 |
| 28178 | 5854897 | 2486.00 | 483.00 | 793.00 | 624.00 | 0.32 |
| 6644 | 5700037 | 2603.00 | 361.00 | 716.00 | 1150.00 | 0.40 |
| 314 | 5304 | 1184.00 | 341.00 | 426.00 | 516.00 | 0.32 |
| 9900 | 5751422 | 1119.00 | 331.00 | 625.00 | 2204.00 | 10.87 |
| 15394 | 5802432 | 2495.00 | 322.00 | 745.00 | 701.00 | 0.32 |
| 16809 | 5809912 | 1352.00 | 321.00 | 863.00 | 3059.00 | 5.19 |
| 18415 | 5815662 | 1697.00 | 310.00 | 653.00 | 1219.00 | 0.91 |
| 9862 | 5751383 | 1035.00 | 298.00 | 550.00 | 2341.00 | 10.24 |
| 14043 | 5792800 | 911.00 | 285.00 | 512.00 | 1527.00 | 10.25 |
prod[prod.purchase == 0]
| product_id | cart | purchase | remove_from_cart | view | price | |
|---|---|---|---|---|---|---|
| 0 | 3752 | 0.00 | 0.00 | 0.00 | 10.00 | 15.71 |
| 3 | 3771 | 0.00 | 0.00 | 0.00 | 9.00 | 15.08 |
| 6 | 3790 | 0.00 | 0.00 | 0.00 | 10.00 | 7.92 |
| 8 | 3809 | 0.00 | 0.00 | 0.00 | 2.00 | 12.54 |
| 9 | 3812 | 0.00 | 0.00 | 0.00 | 1.00 | 12.54 |
| ... | ... | ... | ... | ... | ... | ... |
| 45322 | 5932537 | 1.00 | 0.00 | 0.00 | 1.00 | 1.43 |
| 45323 | 5932538 | 0.00 | 0.00 | 0.00 | 1.00 | 1.43 |
| 45324 | 5932540 | 1.00 | 0.00 | 0.00 | 2.00 | 1.43 |
| 45325 | 5932578 | 0.00 | 0.00 | 0.00 | 1.00 | 6.02 |
| 45326 | 5932585 | 0.00 | 0.00 | 0.00 | 2.00 | 6.33 |
21850 rows × 6 columns
Conclusion:
Almost half of the products have not had any sales in the 5 months of the data history.
Since this analysis includes sales, we will eliminate products that have not had any sales.
sns.scatterplot(data = prod[prod.purchase > 0], x = 'price', y = 'purchase', hue = 'price');
There is a clear decreasing relationship.
prod.view.sort_values(ascending = False)[0:20].plot.bar();
sns.scatterplot(data = prod, x = 'view', y = 'purchase');
Remove the outlier and zoom into the many views few purchases window.
sns.scatterplot(data = prod.loc[prod.view < 4000], x = 'view', y = 'purchase', hue = 'price')
plt.xlim(1000,3000)
plt.ylim(0,150)
(0.0, 150.0)
Uno de los activos que más pueden incrementar las ventas de un ecommerce es un sistema de recomendación.
Ya podríamos aplicar uno básico con los análisis de más visto y más vendido realizados anteriormente.
Pero la verdadera potencia viene cuando creamos un recomendador que personaliza para cada compra.
Tipos de sistemas de recomendación:
En nuestro caso vamos a desarrollar uno con filtrado colaborativo basado en items.
Los pasos a seguir son:
The KPI is the number of times the products have been purchased by the same user.
We will build the matrix user-product.
Due to PC constrains, we biuld the recommender system just for the 100 best-selling products.
most_sold = prod.sort_values('purchase', ascending = False).product_id[0:100]
temp = df.loc[df.product_id.isin(most_sold)]
temp
| event_type | product_id | category_id | price | user_id | user_session | date | year | month | day | hour | holiday | black_friday | valentin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| event_time | ||||||||||||||
| 2019-10-01 00:26:49+00:00 | purchase | 5815662 | 1487580006317032337 | 0.92 | 536128518 | a31f0991-645e-4472-a012-95eb2f814568 | 2019-10-01 | 2019 | 10 | 1 | 0 | 0 | 0 | 0 |
| 2019-10-01 00:46:20+00:00 | view | 5809912 | 1602943681873052386 | 5.24 | 555415545 | b9cc1771-9062-4e08-a3ad-363314cd17d8 | 2019-10-01 | 2019 | 10 | 1 | 0 | 0 | 0 | 0 |
| 2019-10-01 00:48:13+00:00 | view | 5816170 | 1602943681873052386 | 5.24 | 555415545 | b9cc1771-9062-4e08-a3ad-363314cd17d8 | 2019-10-01 | 2019 | 10 | 1 | 0 | 0 | 0 | 0 |
| 2019-10-01 00:52:39+00:00 | view | 5815730 | 1487580005092295511 | 10.95 | 555415545 | b9cc1771-9062-4e08-a3ad-363314cd17d8 | 2019-10-01 | 2019 | 10 | 1 | 0 | 0 | 0 | 0 |
| 2019-10-01 01:33:26+00:00 | view | 5849033 | 1487580005092295511 | 10.32 | 555456891 | b3239dc3-f107-4034-a507-4c41f646e38a | 2019-10-01 | 2019 | 10 | 1 | 1 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2020-02-29 23:11:44+00:00 | cart | 5793703 | 1487580006509970331 | 2.22 | 615102046 | 17b94398-0397-4c59-bc84-fe91dde0a8ec | 2020-02-29 | 2020 | 2 | 29 | 23 | 0 | 0 | 0 |
| 2020-02-29 23:12:40+00:00 | cart | 5550302 | 2195085255034011676 | 1.21 | 615102046 | 17b94398-0397-4c59-bc84-fe91dde0a8ec | 2020-02-29 | 2020 | 2 | 29 | 23 | 0 | 0 | 0 |
| 2020-02-29 23:12:50+00:00 | cart | 5809911 | 1602943681873052386 | 5.24 | 599909613 | 1c6c708d-135d-487b-afa9-4bbcfd28db4d | 2020-02-29 | 2020 | 2 | 29 | 23 | 0 | 0 | 0 |
| 2020-02-29 23:20:21+00:00 | view | 5816170 | 1602943681873052386 | 5.24 | 231719601 | a7467d5c-e848-406f-97f4-fcb6a4113e68 | 2020-02-29 | 2020 | 2 | 29 | 23 | 0 | 0 | 0 |
| 2020-02-29 23:58:49+00:00 | cart | 5815662 | 1487580006317032337 | 0.92 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 | 2020-02-29 | 2020 | 2 | 29 | 23 | 0 | 0 | 0 |
168170 rows × 14 columns
Creamos la matriz usuario-item.
We create the user-product matrix:
user_product = temp.loc[temp.event_type == 'purchase'].groupby(['user_id','product_id']).size().unstack(level = 1).fillna(0)
user_product
| product_id | 4497 | 4600 | 4768 | 4938 | 4958 | 5013 | 5304 | 5526 | 5528034 | 5528035 | ... | 5848909 | 5849033 | 5854812 | 5854897 | 5855332 | 5857007 | 5857360 | 5862564 | 5862943 | 5889300 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| user_id | |||||||||||||||||||||
| 25392526 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 50748978 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 74332980 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 80577370 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 88211255 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 621646584 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 621788730 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 |
| 621925941 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 621974977 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 622021687 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
5064 rows × 100 columns
Regarding to the distance metric, we use the euclidean distance (function spatial.distance.euclidean from Scipy)
from scipy import spatial
def recommender(dataframe):
def distance(product):
return(dataframe.apply(lambda x: spatial.distance.euclidean(x,product)))
return(dataframe.apply(lambda x: distance(x)))
product_product = recommender(user_product)
product_product
| product_id | 4497 | 4600 | 4768 | 4938 | 4958 | 5013 | 5304 | 5526 | 5528034 | 5528035 | ... | 5848909 | 5849033 | 5854812 | 5854897 | 5855332 | 5857007 | 5857360 | 5862564 | 5862943 | 5889300 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| product_id | |||||||||||||||||||||
| 4497 | 0.00 | 14.42 | 14.49 | 15.62 | 15.91 | 17.58 | 23.39 | 15.23 | 16.16 | 20.30 | ... | 15.72 | 24.45 | 14.97 | 26.10 | 15.30 | 15.07 | 15.78 | 15.17 | 16.40 | 15.33 |
| 4600 | 14.42 | 0.00 | 10.68 | 14.49 | 14.59 | 16.46 | 22.69 | 13.93 | 14.87 | 19.70 | ... | 14.39 | 23.87 | 13.49 | 25.75 | 13.78 | 13.60 | 14.46 | 13.86 | 14.73 | 13.89 |
| 4768 | 14.49 | 10.68 | 0.00 | 14.56 | 14.73 | 16.28 | 22.74 | 14.00 | 14.93 | 19.75 | ... | 14.18 | 24.00 | 13.27 | 25.63 | 13.86 | 13.75 | 14.39 | 13.86 | 14.80 | 14.11 |
| 4938 | 15.62 | 14.49 | 14.56 | 0.00 | 15.52 | 16.03 | 22.96 | 14.76 | 15.65 | 20.25 | ... | 15.13 | 24.17 | 14.35 | 25.87 | 14.76 | 14.59 | 15.46 | 14.97 | 15.72 | 15.13 |
| 4958 | 15.91 | 14.59 | 14.73 | 15.52 | 0.00 | 17.49 | 22.27 | 15.33 | 15.81 | 20.32 | ... | 15.49 | 24.35 | 14.66 | 26.15 | 14.87 | 14.97 | 15.94 | 15.26 | 16.12 | 15.03 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5857007 | 15.07 | 13.60 | 13.75 | 14.59 | 14.97 | 16.79 | 22.80 | 14.59 | 15.10 | 19.62 | ... | 14.49 | 23.98 | 13.89 | 25.61 | 14.32 | 0.00 | 14.83 | 14.46 | 15.17 | 14.42 |
| 5857360 | 15.78 | 14.46 | 14.39 | 15.46 | 15.94 | 16.06 | 23.24 | 15.39 | 16.12 | 20.32 | ... | 15.43 | 23.85 | 14.73 | 25.46 | 15.20 | 14.83 | 0.00 | 13.75 | 16.19 | 15.43 |
| 5862564 | 15.17 | 13.86 | 13.86 | 14.97 | 15.26 | 15.72 | 22.91 | 14.70 | 15.39 | 19.90 | ... | 15.00 | 23.32 | 14.00 | 25.12 | 14.35 | 14.46 | 13.75 | 0.00 | 15.39 | 14.66 |
| 5862943 | 16.40 | 14.73 | 14.80 | 15.72 | 16.12 | 17.66 | 23.62 | 15.33 | 16.00 | 20.42 | ... | 14.90 | 24.43 | 14.87 | 26.42 | 15.26 | 15.17 | 16.19 | 15.39 | 0.00 | 15.23 |
| 5889300 | 15.33 | 13.89 | 14.11 | 15.13 | 15.03 | 17.03 | 23.24 | 14.32 | 15.43 | 19.77 | ... | 14.90 | 23.90 | 14.04 | 25.69 | 14.04 | 14.42 | 15.43 | 14.66 | 15.23 | 0.00 |
100 rows × 100 columns
The following function queries the above table each time a user looks at a product or puts it in the cart, and finds the N_products with the shortest distance.
#In the case of multiple products it will come from the web server as a string separated by semicolon
def prod_prioritizer(products,N_products = 5):
#create array with input products
array = np.int64(products.split(';'))
#extract vectors from the product-product matrix
matriz = product_product[array]
#compute the sum of distances
sum_distances = matriz.agg(sum,axis = 1)
#eliminate input products
sum_distances = sum_distances.loc[~sum_distances.index.isin(list(array))]
#return the N_products with the shortest distance
return(sum_distances.sort_values()[0:N_products]) # ascending --> euclidean distance
Check with one single product and multiple products.
prod_prioritizer('4497')
product_id 5724230 14.39 4600 14.42 5550302 14.49 4768 14.49 5749149 14.56 dtype: float64
prod_prioritizer('4497;4600;4768')
product_id 5749149 40.25 5833318 40.47 5833335 40.81 5809303 40.81 5724230 41.00 dtype: float64